Tech Arsenal 1

home *** CD-ROM | disk | FTP | other *** search

/ Tech Arsenal 1 / Tech Arsenal (Arsenal Computer).ISO / tek-02 / msortp.zip / MSORTP.DOC < prev next >

Wrap

Text File | 1993-03-02 | 25KB | 572 lines

MSORTP - Merge Sort Unit for Protected and Real Mode Kim Kokkonen TurboPower Software January 1993 Introduction --------------------------------------------------------------------- MSORTP is a unit for sorting an unlimited number of items in real or protected mode applications. It can be used with TPW 1.0, TPW 1.5, or BP7 for DOS real mode, DOS protected mode, or Windows targets. Because it takes advantage of PChar strings, MSORTP cannot be used with earlier Turbo Pascal compilers. Although MSORTP isn't source-compatible with Borland's old LSORT unit from the Database Toolbox, its interface is quite similar. Hence, it is a good candidate to replace LSORT in programs being upgraded to new platforms. MSORTP takes advantage of all available heap space, including the extended memory heap when running in protected mode. When it runs out of heap space, it uses a merge sort algorithm that stores temporary intermediate files on disk. MSORTP is an excerpt from TurboPower Software's commercial product B-Tree Filer. Using MSORTP --------------------------------------------------------------------- The following small program shows how to implement a text file sort filter using MSORTP. {$R-,S-,X+,I+} program TextSort; {-Protected mode text file sort filter} uses Strings, MSortP; var LineBuf : array[0..1024] of Char; MaxHeap : LongInt; MinHeap : LongInt; Status : Word; procedure SendToSortEngine; far; var P : PChar; begin while not Eof do begin {Read the next line from the input file as a PChar} ReadLn(LineBuf); {Don't sort empty lines to avoid StrNew quirk} if StrLen(LineBuf) <> 0 then begin {Copy string to heap} P := StrNew(LineBuf); {Store the pointer in the sort engine} if not PutElement(P) then Exit; end; end; end; function Less(var X, Y) : Boolean; far; begin Less := (StrComp(PChar(X), PChar(Y)) < 0); end; procedure GetFromSortEngine; far; var P : PChar; begin {Return each pointer and write the associated string to output} while GetElement(P) do WriteLn(P); end; begin {Determine how much heap space MergeSort can use} MaxHeap := MemAvail div 10; MinHeap := 4*MinimumHeapToUse(SizeOf(PChar)); if MaxHeap < MinHeap then MaxHeap := MinHeap; {Perform the sort} Status := MergeSort(MaxHeap, SizeOf(PChar), SendToSortEngine, Less, GetFromSortEngine, DefaultMergeName); if Status <> 0 then WriteLn('Sort failed, Status ', Status); end. This program provides three routines for the use of MSORTP. SendToSortEngine reads lines of text from the standard input device, allocates space for each line on the heap, and passes the string pointer to MSORTP by calling PutElement. Note that only the pointer is passed to MSORTP, since there is no need for MSORTP to keep a second copy of a string that is already stored in memory. The Less function compares pairs of elements for MSORTP. This routine simply uses the StrComp function from Borland's STRINGS unit to perform a case-sensitive comparison of two PChars. Note the typecast from the untyped VAR parameters Less receives to the PChar data type stored in the sort engine. Finally, GetFromSortEngine retrieves the elements in sorted order. In this program, GetElement return a pointer to the actual string data. When the $X+ compiler directive is enabled, as it is here, the WriteLn statement writes the string pointed to by each pointer P. The main block of the program calls the MergeSort function to perform the sort. First it must decide how to partition the heap between the needs of string storage and the needs of MSORTP. MSORTP is given about 10% of the heap and the rest is reserved for text strings. Since the sort engine is storing 4 byte elements, this is a balanced split if the average string length is 40 bytes. Also note that the MinimumHeapToUse function is called to guarantee that the sort engine gets at least 4 times as much heap space as it requires at a minimum. In this way merging is minimized and the sort will always succeed unless there is insufficient disk space or insufficient heap space to hold the input strings. The second parameter passed to MergeSort is the size of the elements to be sorted, in this case 4 bytes, the size of a PChar. The next three parameters are the routines used to pass elements to the sort engine, to compare elements, and to get sorted elements back from the engine. The final parameter is a routine used to name each merge file. In this program we use the default routine DefaultMergeName provided by MSORTP, which stores the merge files in the current directory. MergeSort returns a Word result indicating the status of the sort. The result is zero to indicate success, or a non-zero result to explain the reason for a failure. TMSORTP.PAS, also in this archive, is a more extensive program for testing the accuracy and speed of the MSORTP unit. Interfaced Identifiers of MSORTP --------------------------------------------------------------------- The following sections describe the constants, types, procedures, and functions interfaced by the MSORTP unit. Constants ========= MaxSelectors = 256; Maximum number of selectors allocated. MergeSort allocates memory buffers using segments of up to 65535 bytes, but the size is always an even power of two times the sort record length. As a result, the default value of MaxSelectors allows MergeSort to directly access 8-16 megabytes of extended memory. If 8 megabytes is insufficient, you may want to increase MaxSelectors to 512. If 8 megabytes is more than enough, you may want to decrease MaxSelectors. The data segment space MSORTP uses for selectors (2*MaxSelectors bytes) dominates the data segment usage of the unit. MedianThreshold = 16; The partition length below which the in-memory quick sort simply uses the middle element of the partition for the pivot element. For partition lengths at least this size, MergeSort uses the median of the left, middle, and right elements for the pivot. The median-of-three algorithm significantly improves the speed of large in-memory sorts for input that is already sorted, reverse sorted, or nearly sorted. MergeOrder = 5; Input files used at a time during merge. You can set MergeOrder to any value in the range from 2 to 10 inclusive. However, experimentation indicates that the default value of 5 is optimal under a wide range of conditions. MinRecsPerRun = 4; Minimum number of records that must fit in memory during a sort. If fewer records fit in memory, MergeInfo and MergeSort return an error code. If even MinRecsPerRun records fit in memory, MergeSort performs merging to complete the sort. Increase this constant if you prefer that the sort fail instead of doing an excessive amount of merging. SwapThreshold = 64; The record size below which MergeSort swaps complete data records. For records SwapThreshold bytes or larger, MergeSort swaps pointers to records instead of the records themselves. Swapping pointers is the faster approach for large records sorted in memory, but this approach has a memory overhead of 4 bytes per record plus a buffer segment that must be used for a run output buffer. The default of 64 was chosen to keep the typical overhead below 10%. Reducing the default also provides no significant improvement in performance. Types ===== ElementIOProc = procedure; Specifies the type of the routine passed as the SendToSortEngine and GetFromSortEngine parameters to MergeSort. These routines must be declared FAR and must have no parameters. ElementCompareFunc = function (var X, Y) : Boolean; Specifies the type of the routine passed as the Less parameter to MergeSort. MergeSort calls this function to compare pairs of elements as needed. It must be declared FAR and must have the form shown here. It should return True if and only if element X is strictly less than element Y. You should typecast the untyped parameters to treat them as elements of the type you are sorting. MergeNameFunc = function (Dest : PChar; MergeNum : Word) : PChar; Specifies the type of the routine passed as the MergeName parameter to MergeSort. MergeSort calls this function to obtain the name of each merge file when needed. MSORTP interfaces a default MergeNameFunc that can be used in many circumstances. This routine, DefaultMergeName, returns a name of the form 'SORnnnnn.TMP', where nnnnn is the merge file number given by MergeNum. A MergeNameFunc must return a unique file name for each value of MergeNum. Typically you specify a non-default MergeNameFunc if you want the merge files to be located on a different drive or directory than the current one. MergeInfoRec = record SortStatus : Word; {Predicted status of sort, assuming disk ok} MergeFiles : Word; {Total number of merge files created} MergeHandles : Word; {Maximum file handles used} MergePhases : Word; {Number of merge phases} MaxDiskSpace : LongInt; {Maximum peak disk space used} HeapUsed : LongInt; {Heap space actually used} SelectorCount: Word; {Number of selectors allocated} RecsPerSel : Word; {Records stored in each selector} end; Describes the structure returned by the MergeInfo function. This function predicts the status of a sort and its resource usage given certain information about it. See MergeInfo for more information. Procedures and Functions ======================== Declaration procedure AbortSort; Purpose Halts a sort prematurely. Description Call this routine from your Less, SendToSortEngine, or GetFromSortEngine routines to abort a sort. The Less function must always return False if it calls AbortSort. When you call AbortSort, MergeSort always returns a status of 1. See Also MergeSort Declaration function DefaultMergeName(Dest : PChar; MergeNum : Word) : PChar; Purpose Returns a default name for each merge file. Description The default merge name is SORnnnnn.TMP, where nnnnn corresponds to the MergeNum. For example, for MergeNum = 1, DefaultMergeName returns 'SOR1.TMP'. For MergeNum = 999, DefaultMergeName returns 'SOR999.TMP'. For MergeNum = 65535 (the largest possible value), DefaultMergeName returns 'SOR65535.TMP'. DefaultMergeName stores the null-terminated string in the buffer pointed to by Dest, and it returns Dest as the function result. MergeSort provides an 80 character buffer each time it calls the merge name function. You can write your own merge name function that puts the merge files somewhere besides the current directory. Follow the example of DefaultMergeName and pass your function to MergeSort. See Also MergeSort Declaration function GetElement(var X) : Boolean; Purpose Returns the next element in sorted order. Description Call this routine repeatedly in your GetFromSortEngine routine to retrieve the sorted elements. GetElement returns True until there are no more sorted elements to retrieve. GetElement copies the next element into the variable you pass as the parameter X. Be sure that this variable is large enough to hold an entire record; otherwise GetElement will overwrite memory. When GetElement returns False, the parameter X is not initialized. See Also PutElement Declaration procedure MergeInfo(MaxHeapToUse : LongInt; RecLen : Word; NumRecs : LongInt; var MI : MergeInfoRec); Purpose Predicts status and resource usage of a merge sort. Description MaxHeapToUse is the maximum number of bytes of heap space the sort should use. MergeInfo actually allocates heap space up to this amount; if there is less heap space available, the MergeInfo results apply only to the available heap space. RecLen is the size in bytes of each record to be sorted. NumRecs is the total number of records to be sorted. MI returns information about the sort. MI.SortStatus is zero if the sort is predicted to succeed. MergeInfo assumes that there is sufficient disk space and that no disk errors will occur. MI.MergeFiles is the total number of merge files that will be created. MI.MergeHandles is the total number of file handles used. This will always be in the range of 0 to MergeOrder+1 inclusive. MI.MergePhases is the number of merge phases. A value of 0 indicates that the sort can be done completely in memory. 1 indicates that MergeOrder or fewer merge files are created and will be merged in one pass. Higher values mean that multiple merge passes are required, with the output from earlier passes feeding the input of later passes. MI.MaxDiskSpace is the peak disk space required. Since merge files are deleted as soon as they are used, the disk space used in a merge sort grows and shrinks. All merge files are deleted when the sort is complete. MaxDiskSpace is always smaller than 2*RecLen*NumRecs. The analysis that MergeInfo performs to determine MaxDiskSpace requires that MI.MergeFiles be smaller than 16384, and that 4*MI.MergeFiles bytes of heap space be free when MergeInfo is called. If these requirements aren't met, MergeInfo returns -1 for MI.MaxDiskSpace. MI.HeapUsed is the number of bytes of heap space the sort will actually use. This is always less than or equal to MaxHeapToUse. MI.SelectorCount is the number of segments of heap space the sort will allocate. MI.RecsPerSel is the number of records stored in each segment. This is always a power of two. See Also MinimumHeapToUse OptimumHeapToUse Declaration function MergeSort(MaxHeapToUse : LongInt; RecLen : Word; SendToSortEngine : ElementIOProc; Less : ElementCompareFunc; GetFromSortEngine : ElementIOProc; MergeName : MergeNameFunc) : Word; Purpose Sorts a group of elements. Description MaxHeapToUse specifies the maximum number of bytes of heap space the sort will use. It is not an error for MaxHeapToUse to exceed MemAvail; MergeSort will use whatever is available. If you know in advance how many records will be sorted, it is a good idea to pass the result returned by OptimumHeapToUse for this parameter. RecLen is the number of bytes in each record to be sorted. SendToSortEngine is a procedure you write that passes the sort elements into the sort engine. Your procedure calls PutElement for each element to be sorted. If PutElement returns False, an error has occurred (usually out of disk space) and you should exit your SendToSortEngine procedure. Less is a function you write that compares pairs of elements. This function must return True if and only if element "X" (the first parameter) is strictly less than element "Y" (the second parameter). GetFromSortEngine is a procedure you write that retrieves the sorted elements from the sort engine. Your procedure calls GetElement to retrieve each element in sorted order. When GetElement returns False, all elements have been retrieved or an error has occurred. MergeName is a function that provides a name for each merge file. You can often pass DefaultMergeName for this parameter. DefaultMergeName puts all of the merge files in the current directory at the time of the merge sort. MergeSort returns a status code in its function result. It can return the following values: 0 success 1 user abort (AbortSort was called) 8 insufficient memory to sort 106 invalid input parameter (RecLen zero, MaxHeapToUse too small) 204 invalid pointer returned by GlobalLock, or SelectorInc <> 8 213 no elements available to sort 214 more than 65535 merge files else DOS or Turbo Pascal error code The most common Turbo Pascal error code returned by MergeSort is 101, which means that there was insufficient disk space to store the merge files. If you want to predict whether a sort can succeed before actually attempting it, use the MergeInfo procedure. See Also DefaultMergeName GetElement MergeInfo PutElement Declaration function MinimumHeapToUse(RecLen : Word) : LongInt; Purpose Returns the minimum heap space that allows MergeSort to succeed. Description Given the size of each record (RecLen), MinimumHeapToUse returns the smallest amount of heap space that will allow a sort to succeed. You can pass this value to MergeSort to sort a group of elements using the smallest amount of memory. Note that the value returned by MinimumHeapToUse is often very small and can cause a significant amount of merging, so it's generally better to multiply the result by a reasonable factor (say 2-4) even if you want to minimize heap usage of a sort. Because of a quirk in the BP7 DPMI heap manager, MinimumHeapToUse adds 2048 bytes to the actual computed heap requirement. This is important because sometimes the DPMI heap manager consumes about 2000 bytes of heap space for its internal use. See Also OptimumHeapToUse Declaration function OptimumHeapToUse(RecLen : Word; NumRecs : LongInt) : LongInt; Purpose Returns the smallest heap space for a sort with no merging. Description Given the size of each record (RecLen) and the number of records to be sorted (NumRecs), OptimumHeapToUse returns the amount of heap space needed to perform the sort entirely in memory. Additional heap space does not help the sort. Less heap space causes merging. Because of a quirk in the BP7 DPMI heap manager, OptimumHeapToUse adds 2048 bytes to the actual computed heap requirement. This is important because sometimes the DPMI heap manager consumes about 2000 bytes of heap space for its internal use. See Also MinimumHeapToUse Declaration function PutElement(var X) : Boolean; Purpose Submits an element to the sort system. Description Call this function within your SendToSortEngine routine for each element to be sorted. Pass the element to be sorted in the untyped parameter X. PutElement returns True if the element is successfully processed by MergeSort. It returns False if an error occurred; do not continue to call PutElement in this case. See Also GetElement How MSORTP Works --------------------------------------------------------------------- The MergeSort function allocates a group of identical segments by calling GlobalAlloc repeatedly. (When MSORTP is compiled for DOS real mode, it provides a substitute routine for GlobalAlloc that is just a shell around the normal Turbo Pascal GetMem routine.) Each segment holds an even power-of-two number of records. The segments collectively use no more than MaxHeapToUse bytes. The last record element of the group of segments is used to store a pivot element. The next to last record element is usually used for an element swap buffer (see below). The remaining elements are collectively called the run buffer and are used for an in-memory sort of each "run" of elements. The "run" must number at least MinRecsPerRun or the sort will not proceed. MergeSort also chooses the segment size so that there are at least MergeOrder+1 segments, since these segments are reused as buffers later during the merge process. To avoid the use of complicated code for writing and reading buffers and an excessive number of LongInt operations, each segment must be smaller than 65536 bytes. Subject to the restrictions already mentioned, however, the segment size is maximized. Hence, typical segment sizes are 32768-65535 bytes. If the segment size is 32768, the maximum memory MergeSort can access is 8MB for the default value of MaxSelectors (256). If this situation adversely effects the performance of a sort, MaxSelectors can be increased to 512 to provide access to the full 16MB of RAM that the DPMI manager can use. The cost is 512 additional bytes of data segment space. Conversely, if a sort will never need to access 8MB of memory, MaxSelectors can be reduced and data segment usage will decrease proportionately. Elements are obtained when SendToSortEngine calls PutElement, which adds elements to the run buffer. When the run buffer fills, or when there are no more elements, the run buffer is sorted in memory using a non-recursive quicksort which calls the Less function to compare elements. If all elements fit into a single run buffer, no merging is required and GetFromSortEngine retrieves the elements directly from the sorted run buffer by calling GetElement. If there are more elements than fit in one run buffer, the sorted run buffer is written to a merge file assigned a sequential number. Merge files are created until all elements have been obtained and all run buffers sorted. When all merge files have been created, the merge process starts. The first MergeOrder segments of memory are reused as input buffers for reading from up to MergeOrder input files opened simultaneously. The next segment is used as an output buffer for storing the merged result. Any other allocated segments are not used during merging. Up to MergeOrder merge files are opened for input and a bufferful is read from each merge file into its buffer. The first unused element in each buffer is compared and the smallest is sent to the output. When the elements in an input file are exhausted, the file is closed and deleted. When all input files are gone, the output file is closed. The output file then becomes part of the input merge sequence for the next merge phase. When fewer than MergeOrder input files remain, each merged output element is passed directly back to the caller via the GetElement routine instead of writing it to another merge file. In this way, MergeSort avoids a final pass of writing to disk and reading back the records. Due to a quirk of the BP7 DPMI heap manager, MergeInfo or MergeSort may appear to leak heap space. Both routines deallocate all of the segments that they initially allocate. The quirk is that, after the DPMI heap manager makes a small number of allocations, it allocates about 2000 bytes of memory for itself and does not release this memory even if all of the user allocations are later released. Hence, MergeInfo may return a value of HeapUsed about 2000 bytes larger than needed, and MemAvail may be 2000 bytes smaller after you call MergeInfo or MergeSort. OptimumHeapToUse and MinimumHeapToUse also return a value 2048 bytes bigger than may be needed because of this quirk. When MergeSort performs an in-memory quick sort of a run, the size of the data record determines whether MergeSort swaps pointers to records or the complete records. When record size is greater than or equal to SwapThreshold (default 64), MergeSort swaps pointers. Otherwise, it swaps entire records. When it swaps pointers, MergeSort must allocate an additional 4 bytes of space for each record to hold the swap pointer. It must also reserve one segment for writing the sorted run to a merge file. Given the default SwapThreshold value of 64, the memory overhead of swapping pointers is typically less than 10%. Swapping pointers gives large gains in sort speed (2x or more) for large records that can be sorted completely in memory. Version History --------------------------------------------------------------------- Version 5.40 1/93 Initial release. Version number synchronized with that of B-Tree Filer. Copyright and License Information --------------------------------------------------------------------- MSORTP is copyright (c) 1993 by TurboPower Software. All Rights Reserved. Although the MSORTP files are copyrighted, you may use and distribute them freely as long as you do not sell them or include them with other software that you sell. The latest version of MSORTP is always available in LIB 6 of the PCVENB forum of CompuServe. It's also available on the TurboPower BBS at 719-260-9726. Look for MSORTP.ZIP. MSORTP is an excerpt from TurboPower Software's product B-Tree Filer. B-Tree Filer is a collection of routines for managing single user and network databases, as well as for accessing common network functions on Novell and NetBIOS networks. Download FILER.BRO from LIB 6 of PCVENB for more information, or give TurboPower a call. You can reach TurboPower at: TurboPower Software P.O. Box 49009 Colorado Springs, CO 80949-9009 719-260-6641 (voice, Monday-Friday 9AM-5PM) 719-260-7151 (fax) 719-260-9136 (sales) 719-260-9726 (BBS) Compuserve: 76004,2611